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Abstract 

This paper shows how sparse, high-dimensional probability distributions could 
be represented by neurons with exponential compression. The representation is a 
novel application of compressive sensing to sparse probability distributions rather 
than to the usual sparse signals. The compressive measurements correspond to 
expected values of nonlinear functions of the probabilistically distributed vari- 
ables. When these expected values are estimated by sampling, the quality of the 
compressed representation is limited only by the quality of sampling. Since the 
compression preserves the geometric structure of the space of sparse probability 
distributions, probabilistic computation can be performed in the compressed do- 
main. Interestingly, functions satisfying the requirements of compressive sensing 
can be implemented as simple perceptrons. If we use perceptrons as a simple 
model of feedforward computation by neurons, these results show that the mean 
activity of a relatively small number of neurons can accurately represent a high- 
dimensional joint distribution implicitly, even without accounting for any noise 
correlations. This comprises a novel hypothesis for how neurons could encode 
probabilities in the brain. 



1 Introduction 

An arbitrary probability distribution over multiple variables has a parameter count that is exponential 
in the number of variables. Representing these probabilities can therefore be prohibitively costly. 
One common approach is to use graphical models to parameterize the distribution in terms of a 
smaller number of interactions. Here I consider an alternative approach. In many cases of interest, 
only a few unknown states have high probabilities while the rest have neglible ones; such a distribu- 
tion is called 'sparse'. I will show that sufficiently sparse distributions can be described by a number 
of parameters that is merely linear in the number of variables. 

Until recently, it was generally thought that encoding of sparse signals required dense sampling at a 
rate greater than or equal to signal bandwidth. However, recent findings prove that it is possible to 
fully characterize a signal at a rate limited not by its bandwidth but by its information content JT] |2] 
[3] ID which can be much smaller. Here I apply such compression to sparse probability distributions 
over binary variables, which, after all, are just signals with some particular properties. 

Traditional compressive sensing considers signals that lives in an iV-dimensional space but have 
only S nonzero coordinates in some basis. We say that such a signal is 5-sparse. If we were told 
the location of the nonzero entries, then we would need only S measurements to characterize their 
coefficients and thus the entire signal. But even if we don't know where those entries are, it still 
takes little more than S linear measurements to perfectly reconstruct the signal. Furthermore, those 
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measurements can be fixed in advance without any knowledge of the structure of the signal. Under 
certain conditions, these excellent properties can be guaranteed ||U[2]m. 

The basic mathematical setup of compressive sensing is as follows. Assume that an TV-dimensional 
signal s has S nonzero coefficients. We make M linear measurements y of this signal by applying 
the M x N matrix A: 

y = As (1) 

We would then like to recover the original signal s from these measurements. Under certain condi- 
tions on the measurement matrix A described below, the original can be found perfectly by comput- 
ing the vector with minimal £% norm that reproduces the measurements, 

s = argmin \\s \\^ such that As' = y = As (2) 

s' 

This is a powerful, practical result because Q can be solved efficiently (U |2] [3j [5] . 

Compressive sensing is generally robust to two deviations from this ideal setup. First, target signals 
may not be strictly S'-sparse. However, they may be 'compressible' in the sense that they are well 
approximated by an S'-sparse signal. Signals whose rank-ordered coefficients fall off at least as fast 
as rank -1 satisfy this property [2]. Second, measurements may be corrupted by noise with bounded 
amplitude e. Under these conditions, the error of the l\ -reconstructed signal s is bounded by the 
error of the best S-sparse approximation sg plus a term proportional to the measurement noise: 

< Co\\s s -aWeJy/S + de (3) 

for some constants Co and C\ J6) . 

Several conditions on A have been used in compressive sensing to guarantee good performance 
E]|4l|7]|8]|9l- Modulo various nuances, they all essentially ensure that most or all relevant sparse 
signals lie sufficiently far from the null space of A: It would be impossible to recover signals in the 
null space since their measurements are all zero and cannot therefore be distinguished. The most 
commonly used condition is the Restricted Isometry Property (RIP), which says that A preserves £ 2 
norms of all S'-sparse vectors within a factor of 1 ± 8s that depends on the sparsity, 

(l-S s )\\s\U 2 < \\As\\ e2 < (1+S s )\\s\\e 2 (4) 

If A satsifies the RIP with small enough 8s, then l\ recovery is guaranteed to succeed. For random 
matrices whose elements are independent and identically distributed Gaussian or Bernoulli variates, 
the RIP holds as long as the number of measurements M satisfies 

M > CS log N/S (5) 

for some constant C that depends on 6s [6|. No other recovery method, however intractable, can 
perform substantially better than this [6]. 

2 Compressing sparse probability distributions 

Compressive sensing allows us to use far fewer resources to accurately represent high-dimensional 
objects if they are sufficiently sparse. Even if we don't ultimately intend to reconstruct the signal, the 
reconstruction theorem described above ([3]) ensures that we have implicitly represented all the rel- 
evant information. This compression proves to be extremely useful when representing multivariate 
joint probability distributions, whose size is exponentially large even for the simplest binary states. 

Consider the signal to be a probability distribution over an n-dimensional binary vector x G 
{—1, +1}™, which I will write sometimes as a function p(x) and sometimes as a vector p indexed 
by the binary state x. I assume p is sparse in the canonical basis of delta-functions on each state, 
8 x x i. The dimensionality of this signal is N = 2™, which for even modest n can be so large it 
cannot be represented explicitly. 

The measurement matrix A for probability vectors has size M x 2". Each row corresponds to a 
different measurement, indexed by i. Each column corresponds to a different binary state x. This 
column index x ranges over all possible binary vectors of length n, in some conventional sequence. 
For example, if n = 3 then the column index would take the 8 values 

x e { ; h ; - + - ; — h+ ; H ; H h ; +H — ; +++} 
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Each element of the measurement matrix, Ai(x), can be viewed as a function applied to the binary 
state. When this matrix operates on a probability distribution p(x), the result y is a vector of M 
expectation values of those functions, with elements 
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A iP = Mx)p{x) = (Mx)) p{x) (6) 



For example, if Ai(x) = xi then yi = (xi) p , x ^ measures the mean of Xi drawn from p(x). 

For suitable measurement matrices A, we are guaranteed accurate reconstruction of S'-sparse prob- 
ability distributions as long as the number of measurements is 

M > 0{S log N/S) = 0(Sn-S \ogS) (7) 

Note that the exponential size of the probability vector, N — 2™, is cancelled by the logarithm. For 
distributions with a fixed sparseness S, the required number of measurements per variable, M/n, is 
then independent of the number of variablesFT 

In many cases of interest it is impractical to calculate these expectation values directly: Recall that 
the probabilities may be too expensive to represent explicitly in the first place. One remedy is to 
draw T samples x t from the distribution p{x), and use a sum over these samples to approximate the 
expectation values, 

Vi ~ T 51 A ^ Xt ) Xt ~ P( x ) W 
t 

The probability p(x) estimated from T samples has errors with variance p(x)(l — p(x))/T, which 
is bounded by 1/4T. This allows us to use the performance limits from robust compressive sensing, 
which according to ([3]) creates an error in the reconstructed probabilities that is bounded by 

C 

||p ~p\\e 2 < C \\p s -p\\e 2 + ~L (9) 
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where p s is a vector with the top S probabilities preserved and the rest set to zero. 
2.1 Measurements by random perceptrons 

In compressive sensing it is common to use a matrix with independent Bernoulli -distributed random 
values, Ai (x) ~ which guarantees A satisfies the RIP 1 10 1 . Each row of this matrix represents 

all possible outputs of an arbitrarily complicated Boolean function of the n binary variables x. 

Biological neural networks would have great difficulty computing such arbitrary functions in a sim- 
ple manner. However, neurons can easily compute a large class of simpler boolean functions, the 
perceptrons. These are simple threshold functions of a weighted average of the input 

Ai(x) = sgn QT ,U ,r''.; ~ 6j) (10) 

where W is an M x n matrix. Here I take W to have elements drawn randomly from a standard 
normal distribution, ~ jV(0, 1), and call the resultant functions 'random perceptrons'. An 
example measurement matrix for random perceptrons is shown in Figure [T] These functions are 
readily implemented by individual neurons, where xj is the instantaneous activity of neuron j, 
Wij is the synaptic weight between neurons i and j, and the sgn function approximates a spiking 
threshold at 6. 

The step nonlinearity sgn is not essential, but some type of nonlinearity is. Using a simple linear 
function of the states, A = Wx, would result in measurements y — Ap — W (x). This provides 
at most n linearly independent measurements of p(x), even when M > n. In most cases this is not 
enough to adequately capture the full distribution. 

Although the dimensionality of W is merely M xn, which is much smaller than the 2™ -dimensional 



space of probabilities, ( 10 1 can generate 0(2™ ) distinct perceptrons ifTTI . By including an appropri- 
ate threshold, a perceptron can assign any individual state x a positive response and assign a negative 

'Depending on the problem, the number of significant nonzero entries S may grow with the number of 
variables. This growth may be fast (e.g. the number of possible patterns grows as e n ) or slow (e.g. the number 
of possible translations of a given pattern grows only as n). 
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State vector 



Figure 1: Example measurement matrix Ai(x) for M = 100 random perceptrons applied to all 2 9 
possible binary vectors of length n = 9. 



response to every other state. This shows that random perceptrons generate the canonical basis and 
can thus span the space of possible p(x). In what follows, I assume that = for simplicity. 

Below I present empirical evidence that a small number of random perceptrons preserves the in- 
formation about sparse distributions. In the Appendix I show that random perceptrons with zero 
threshold are incoherent with the canonical basis, and the rows of the random perceptron measure- 
ment matrix are asymptotically orthogonal in the limit of large n. Random perceptrons thus satisfy 
the requirements for RIPless compressive sensing [4|. Present research is directed toward deriv- 
ing the condition number of these matrices for finite n, in order to provide rigorous bounds on the 
number of measurements required in practice. 

3 Experiments 

3.1 Fidelity of compressed sparse distributions 

To test random perceptrons in compressive sensing of probabilities, I generated sparse distributions 
using small Boltzmann machines lfT2l . and compressed them using random perceptrons driven by 
samples from the Boltzmann machine. Performance was then judged by l\ reconstructions. 

In a Boltzmann Machine, binary states x occur with probabilities given by the Boltzmann distribu- 
tion 

p(x) oc e- E( - x) (11) 

for an energy function 

E(x) = -b T x - x T Jx (12) 

determined by biases b and pairwise couplings J. Sampling from this distribution can be accom- 
plished by running Glauber dynamics fl3l . at each time step turning a unit on with probability 

p(x t = +l\x\i) = 1/(1 + e~ AE ), where AE = E(x l = +l,x\j) - E(xi = -l,x\i). Here x\i 
is the vector of all components of x except the ith. 

For simulations I distinguished between two types of units, hidden and visible, x = (h,v). On 
each trial I first generated a sample of all units according to (JTTJ. I then fixed only the visible 
units and allowed the hidden units to fluctuate according to the conditional probability p(h\v) to be 
represented. This probability is given again by the Boltzmann distribution, now with energy function 

E(h\v) = -(b h - J hv v) T h - h T J hh h (13) 

All bias terms b were set to zero, and all pairwise couplings J were random draws from a zero- 
mean normal distribution, ~ A/"(0, |). Experiments used n hidden and n visible units, with 
n 6 {8, 10, 12}. This distribution of couplings produced sparse posterior distributions whose rank- 
ordered probabilities fell faster than rank -1 and were thus compressible |2|. 

The compression was accomplished by passing the hidden unit activities h through random per- 
ceptrons a with weights W, according to a = sgn (W h). These perceptron activities fluctuate 
along with their inputs. The mean activity of these perceptron units compressively senses the prob- 
ability distribution according to (Rjl. This process of sampling and then compressing a Boltzmann 
distribution can be implemented by the simple neural network shown in Figure [2] 
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Figure 2: Compressive sensing of a probability distribution by model neurons. Left: a neural archi- 
tecture for generating and then encoding a sparse, high-dimensional probability distribution. Right: 
activity of each population of neurons as a function of time. Sparse posterior probability distribution 
are generated by a Boltzmann Machine with visible units v (Inputs), hidden units h (Samplers), 
feedforward couplings J v h from visible to hidden units, and recurrent connections between hidden 
units Jhh- The visible units' activities are fixed by an input. The hidden units are stochastic, and 
sample from a probability distribution p(h\v). The samples are recoded by feedforward weights W 
to random perceptrons a. The mean activity y of the time-dependent perceptron responses captures 
the sparse joint distribution of the hidden units. 

We are not ultimately interested in reconstruction of the large sparse distribution, but rather its com- 
pressed representation. Nonetheless, reconstruction is useful to show that the information has been 
preserved. I reconstruct sparse probabilities using nonnegative l\ minimization with measurement 
constraints lfT4l[T5l . minimizing 

||p||^ i+ A|L4p-i/||| (14) 

where A is a regularization parameter that was set to 2T in all simulations. Reconstructions were 
quite good, as shown in Figure[3] Even with far fewer measurements than signal dimensions, recon- 
struction accuracy is limited only by the sampling of the posterior. Enough random perceptrons do 
not lose any available information. 

In the context of probability distributions, l\ reconstruction has a serious flaw: All distributions have 
the same £\ norm: \\p\\e 1 = J2 x p( x ) = ^ To minimize the l\ norm, therefore, the estimate will 
not be a probability distribution. Nonetheless, the individual probabilities of the most significant 
states are accurately reconstructed, and only the highly improbable states are set to zero. Figure [3^5 
shows that the shortfall is small: l\ reconstruction recovers over 90% of the total probability mass. 

3.2 Preserving computationally important relationships 

There is value in being able to compactly represent these high-dimensional objects. However, it 
would be especially useful to perform probabilistic computations using these representations, such 
as marginalization and evidence integration. Since marginalization is a linear operation on the prob- 
ability distribution, this is readily implementable in the linearly compressed domain. In contrast, 
evidence integration is a multiplicative process acting in the canonical basis, so this operation will 
be more complicated after the linear distortions of compressive measurement A. Nonetheless, such 
computations should be feasible as long as the informative relationships are preserved in the com- 
pressed space: Similar distributions should have similar compressive representations, and dissimilar 
distributions should have dissimilar compressive representations. In fact, that is precisely the guar- 
antee of compressive sensing: topological properties of the underlying space are preserved in the 
compressive domain [16|. Figure [4] illustrates how not only are individual sparse distributions re- 
coverable despite significant compression, but the topology of the set of all such distributions is 
retained. 

For this experiment, an input x is drawn from a dictionary of input patterns X C { + 1,-1}". 
Each pattern in A" is a translation of a single binary template x° whose elements are generated by 
thresholding a noisy sinusoid (Figure|4|\): x® = sgn [4sin (2irj jn) + r/j] with ijj ~ A/"(0, 1). On 
each trial, one of these possible patterns is drawn randomly with equal probability and then 
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Figure 3: Reconstruction of sparse posteriors from random perceptron measurements. (A) A sparse 
posterior distribution over 10 nodes in a Boltzmann machine is sampled 1000 times, fed to 50 ran- 
dom perceptrons, and reconstructed by nonnegative l\ minimization. (B) A histogram of the sum of 
reconstructed probabilities reveals the small shortfall from a proper normalization of 1. (C) Scatter 
plots show reconstructions versus true probabilities. Each box uses different numbers of compres- 
sive measurements M and numbers of samples T. (D) With increasing numbers of compressive 
measurements, the mean squared reconstruction error falls to 1/T = 10~ 3 , the limit imposed by 
finite sampling. 



is measured by a noisy process that randomly flips bits with a probability r\ = 0.35 to give a noisy 
pattern r. This process induces a posterior distribution over the possible input patterns 

P(x\r) = ^p(x)p(r\x) = -=p(x) JJp(^l^) (15) 

i 

= ^p(x)r] N - h ^' r \l - v ) h ^ (16) 

where h(x,r) is the Hamming distance between x and r. This posterior is nonzero for all patterns in 
the dictionary. The noise level and the similarities between the dictionary elements together control 
the sparseness. 
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Figure 4: Nonlinear embeddings of a family of probability distributions with a translation symmetry. 
(A) The process of generating posterior distributions: (i) A set of 100 possible patterns is generated 
as cyclic translations of a binary pattern (only 9 shown). With uniform probability, one of these pat- 
terns is selected (ii), and a noisy version is obtained by randomly flipping bits with probability 0.35 
(iii). From such noisy patterns, an observer can infer posterior probability distributions over possible 
inputs (iv). (B) The set of posteriors from 1000 iterations of this process is nonlinearly mapped ifTTIl 
from 100 dimensions to 2 dimensions. Each point represents one posterior and is colored according 
to the actual pattern from which the noisy observations were made. The permutation symmetry of 
this process is revealed as a circle in this mapping. (C) This circular structure is retained even after 
each posterior is compressed into the mean output of 10 random perceptrons. 
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1000 trials of this process generates samples from the set of all possible posterior distributions. 
Just as the underlying set of inputs has a translation symmetry, the set of all possible posterior 
distributions has a cyclic permutation symmetry. This symmetry can be revealed by a nonlinear 
embedding [17] of the set of posteriors into two dimensions (Figure[4|3). 

Compressive sensing of these posteriors by 10 random perceptrons produces a much lower- 
dimensional embedding that preserves this symmetry. Figure |4p shows that the same nonlinear em- 
bedding algorithm applied to the reduced representation, and one sees the same topological pattern. 
In compressive sensing, similarity is measured by Euclidean distance. When applied to probability 
distributions it will be interesting to examine instead how well information-geometric measures like 
the Kullback-Leibler divergence are preserved under this dimensionality reduction 1 18 1. 



4 Discussion 

Probabilistic inference appears to be essential for both animals and machines to perform well on 
complex tasks with natural levels of ambiguity, but it remains unclear how the brain represents 
and manipulates probability. Present population models of neural inference either struggle with 
high-dimensional distributions 1 19 1 or encode them by hard-to-measure high-order correlations |20|. 
Here I have proposed an alternative mechanism by which the brain could efficiently represent prob- 
abilities: random perceptrons. In this model, information about probabilities is compressed and 
distributed in neural population activity. Amazingly, the brain need not measure any correlations 
between the perceptron outputs to capture the joint statistics of the sparse input distribution. Only 
the mean activities are required. Figure|2]illustrates one network that implements this new represen- 
tation, and many variations on this circuit are possible. 

Successful encoding in this compressed representation requires that the input distribution be sparse. 
Posterior distributions over sensory stimuli like natural images are indeed expected to be highly 
sparse: the features are sparse [21 J, the prior over images is sparse [22|, and the likelihood pro- 
duced by sensory evidence is usually restrictive, so the posteriors should be even sparser. Still, it 
will be important to quantify just how sparse the relevant posteriors are under different conditions. 
This would permit us to predict how neural representations in a fixed population should degrade as 
sensory evidence becomes weaker. 

Brains appear to have a mix of structure and randomness. The results presented here show that 
purely random connections are sufficient to ensure that a sparse probability distribution is properly 
encoded. Surprisingly, more structured connections cannot allow a network with the same computa- 
tional elements to encode distributions with substantially fewer neurons, since compressive sensing 
is already nearly optimal |6|. On the other hand, some representational structure may make it easier 
to perform computations later. Note that unknown randomness is not an impediment to further pro- 
cessing, as reconstruction can be performed even without explicit knowledge of random perceptron 
measurement matrix [23 1. 

Even in the most convenient representations, inference is generally intractable and requires approx- 
imation. Since compressive sensing preserves the essential geometric relationships of the signal 
space, learning and inference based on these relationships may be no harder after the compression, 
and could even be more efficient due to the reduced dimensionality. Biologically plausible mech- 
anisms for implementing probabilistic computations in the compressed representation is important 
work for the future. 



Appendix: Asymptotic orthogonality of random perceptron matrix 

To evaluate the quality of the compressive sensing matrix, we need to ensure that S'-sparse vectors 
are not projected to zero by the action of A. Here we show that the inner products between the 
columns of A = Aj\fM are concentrated around zero by computing their mean and variance over 
the ensemble of ~ A/"(0, 1) and for random pairs of binary states x and x' . For compactness I 
will write Wi for the ith row of the perceptron weight matrix W. 
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First I compute the mean and variance of the mean inner product (C xx > ) w between columns of A 
for a given pair of states x 7^ x': 

(Cwm')w = (52 .A l (x)A l (x')\ = ^-^2_(sgn(w l -x)sgn(w. r x')) w (17) 



w M 

and since the different Wi are independent, this implies that 

(C xx >) w = (sgn(wi-x) sgn(wi-x')) w (18) 

The n-dimensional half-space in W where sgn (wi ■ x) = +1 intersects with the corresponding 
half-space for x' in a wedge-shaped region with an angle of 9 = cos _1 (a3 • x'/||x||^ 2 ||a;'||^ 2 ). This 
angle is related to the Hamming distance h = h(x, x')\ 

9(h) = cos-\x ■ x' In) = cos _1 (l - 2h/n) (19) 

The signs of Wi X and ■ x' agree within this wedge region and its reflection about W = 0, and 
disagree in the supplementary wedges. The mean inner product is therefore 

(<?»«') w =(+l)P[sgn(«7i-x) =sgn(wi-x')}+ (20) 
(-l)P[sgn(w r x) ^sgn(wi-x')] (21) 
=1- 26(h) /t: (22) 

The variance of C xx i caused by variability in W is given by 

V xx > = (C' xx ,) w - (C xx ') w (23) 
= J2 (A 2 i (x)A 2 i (x')) w + J2 (A i (x)A i (x')A j (x)A j (x')) w (C XX ,) 2 W (24) 




sgn(w v x) 2 sgn(wi-x') 2 \ ^ / sgn (tUi -x) sgn (wi -x') 



M 2 




M M /... f^f \ VM VM 

M 2 - M 



~ {C xx ') w (25) 
w 



(l-29(h)/ny- (C xx ,y w (26) 



v/ , (l-^9(h(x,x'))) 2 ) (27) 
This variance falls with M, so for large numbers of measurements M the inner products between 



columns concentrates around the various state-dependent mean values ( 22 1 



Next I consider the diversity of inner products for different pairs (x, x' ) of binary state vectors, in the 
limit of large M so that the diversity is dominated by variations over the different pairs. The mean 
inner product depends only on the Hamming distance h between x and x', which for sparse signals 
with random support has a binomial distribution, p(h) = u)2~" with mean n/2 and variance 11/ '4. 

Designating by an overbar the average over randomly chosen states, the mean C and variance 8C 2 
of the inner product is 



C = {C x *,) w = 1 - f cos-^l - f ) = (28) 
<9C\ 2 n 16 



\oh J 4 TV Tl TT z Tl 

This proves that in the limit of large n and M, the columns of the random perceptron measurement 
matrix have inner products that concentrate around 0. The Grammian of A is just a matrix of many 
such inner products, and is orthonormal almost surely, A T A — > I. Consequently, with enough 
measurements a sparse signal can be recovered perfectly according to RIPless compressive sensing 
[4 |. Future work will determine how this Grammian matrix behaves for finite n and M, which will 
determine the number of measurements required in practice to capture a signal of a given sparseness. 
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